import pandas as pd
import matplotlib.pyplot as plt
import folium
df = pd.read_csv('https://data.humdata.org/dataset/85ee8e10-0c66-4635-b997-79b6fad44c71/resource/ec896b64-c922-4737-b759-e4bd7f73b8cc/download/social_capital_county.csv')Methods
Exploration
Here we’ll be exploring summary information about the data, summary statistics, identify any missing/incorrect/outlier data, and conduct data preprocessing if necessary.
We’ll also cover 4 exploratory data visualizations to better understand its structure, contents, and whether the data is clean enough to continue into the analysis.
df| county | county_name | num_below_p50 | pop2018 | ec_county | ec_se_county | child_ec_county | child_ec_se_county | ec_grp_mem_county | ec_high_county | ... | child_exposure_county | child_high_exposure_county | bias_grp_mem_county | bias_grp_mem_high_county | child_bias_county | child_high_bias_county | clustering_county | support_ratio_county | volunteering_rate_county | civic_organizations_county | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1001 | Autauga, Alabama | 5922.39210 | 55200.0 | 0.72077 | 0.00831 | 1.11754 | 0.02467 | 0.77223 | 1.21372 | ... | 1.14816 | 1.19944 | 0.05526 | -0.22748 | 0.02668 | -0.08229 | 0.10347 | 0.98275 | 0.04355 | 0.01518 |
| 1 | 1003 | Baldwin, Alabama | 15458.39600 | 208107.0 | 0.74313 | 0.00661 | 0.83064 | 0.01629 | 0.76215 | 1.28302 | ... | 0.84588 | 1.00797 | 0.02950 | -0.21519 | 0.01802 | -0.05241 | 0.09624 | 0.98684 | 0.06117 | 0.01526 |
| 2 | 1005 | Barbour, Alabama | 4863.97360 | 25782.0 | 0.41366 | 0.00978 | 0.58541 | 0.02707 | 0.35927 | 0.91897 | ... | 0.63306 | 0.71967 | 0.13457 | -0.34086 | 0.07528 | -0.19714 | 0.14911 | 0.99911 | 0.02093 | 0.01474 |
| 3 | 1007 | Bibb, Alabama | 3061.49340 | 22527.0 | 0.63152 | 0.01175 | 0.72265 | 0.03027 | 0.68094 | 1.06378 | ... | 0.71433 | 0.72395 | 0.04108 | -0.27727 | -0.01165 | -0.15993 | 0.14252 | 0.99716 | 0.05294 | 0.01439 |
| 4 | 1009 | Blount, Alabama | 6740.91160 | 57645.0 | 0.72562 | 0.00985 | 0.76096 | 0.02466 | 0.79584 | 1.10569 | ... | 0.74821 | 0.79375 | 0.00217 | -0.24946 | -0.01704 | -0.08745 | 0.11243 | 0.99069 | 0.05704 | 0.01724 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3084 | 56037 | Sweetwater, Wyoming | 2402.96900 | 44117.0 | 0.96235 | 0.01280 | 1.14781 | 0.02794 | 1.13449 | 1.32399 | ... | 1.12164 | 1.12907 | 0.09519 | -0.12030 | -0.02333 | -0.08683 | 0.10809 | 0.99710 | 0.07321 | 0.01225 |
| 3085 | 56039 | Teton, Wyoming | 783.24982 | 23059.0 | 1.07623 | 0.01744 | 1.23113 | 0.04692 | 1.13296 | 1.63551 | ... | 1.32874 | 1.35341 | 0.14337 | -0.11958 | 0.07346 | -0.07364 | 0.09253 | 0.98648 | 0.09747 | 0.03223 |
| 3086 | 56041 | Uinta, Wyoming | 2174.06180 | 20609.0 | 0.95452 | 0.01404 | 1.04595 | 0.03455 | 0.92831 | 1.32040 | ... | 1.05446 | 1.06284 | 0.13816 | -0.12194 | 0.00808 | -0.06074 | 0.11204 | 0.99479 | 0.06942 | 0.01222 |
| 3087 | 56043 | Washakie, Wyoming | 872.51544 | 8129.0 | 0.90667 | 0.01928 | 0.90794 | 0.04962 | 0.78223 | 1.29208 | ... | 0.88480 | 0.88589 | 0.06667 | -0.20435 | -0.02615 | -0.06076 | 0.11592 | 0.99708 | 0.05843 | 0.03512 |
| 3088 | 56045 | Weston, Wyoming | 635.28436 | 7100.0 | 0.97840 | 0.02036 | 1.09118 | 0.05823 | 0.93135 | 1.28553 | ... | 1.03325 | 1.05526 | 0.02279 | -0.17229 | -0.05606 | -0.04609 | 0.11927 | 0.99730 | 0.13635 | 0.02375 |
3089 rows × 26 columns
Summary Information
Every county that had statistics via Facebook data were included in the dataset. While there are many columns, many of them will not be used for the final analysis since they are irrelevant to the research questions. Here is a summary of what columns this project will utilize, as well as explanation for why this project will not utilize the others.
ec_county - This measures the level of economic connectedness within a county. The equation for this is two times the share of high-SES (social economic status) friends amongst low-SES individuals and then averaged. This equation is from the research journal “Social capital I: Measurement and Associations with Economic Mobility.”
ec_high_county - This measures the connectedness between high-SES individuals solely, once again with a value that is averaged. The main difference is that this does not include how connected they are with people from lower economic statuses.
exposure_grp_mem_county - This measures how often high-SES individuals are exposed to low-SES individuals with the same formula of from ec_county.
exposure_grp_mem_high_county - This measures how often high-SES individuals are exposed to only high-SES individuals, once again measured with a similar formula.
clustering_county - This calculates the average fraction of an individual’s friend pairs who are also friends with each other including people only within the relevant county.
volunteering_rate_county - The percentage of Facebook users within a county that are predicted to be members of a volunteering or activism group. Secret groups or large groups that Facebook identified as “clearly misclassified” were not included in this calculation.
There are many other columns. One of the main categories include those that revolve around childhood connectedness specifically. These are not calculated because the research question aims to address all populations. Additionally, the two categories, the support_ratio_county, which measures the proportion of people who share a third mutual friend, and so was civic_organizations_county, which measures the number of Facebook Pages predicted to be “Public Good” within a county, are not included as they would not address the research questions effectively. As part of the cleaning process, these categories will be removed. The standard error colums will be kept, as they may need to be referred to in the conclusion.
Summary Statistics
columns_removed = ['num_below_p50', 'pop2018', 'child_ec_county', 'child_ec_se_county', 'ec_grp_mem_county', 'child_high_ec_county', 'child_high_ec_se_county', 'ec_grp_mem_high_county','child_exposure_county','child_high_exposure_county','bias_grp_mem_county','bias_grp_mem_high_county','child_bias_county','child_high_bias_county', 'support_ratio_county', 'civic_organizations_county']
df = df.drop(columns=columns_removed)
column_max = df.max()
column_min = df.min()
column_stats_df = pd.DataFrame({'Column': df.columns, 'Minimum': column_min, 'Maximum': column_max})
print(column_stats_df) Column \
county county
county_name county_name
ec_county ec_county
ec_se_county ec_se_county
ec_high_county ec_high_county
ec_high_se_county ec_high_se_county
exposure_grp_mem_county exposure_grp_mem_county
exposure_grp_mem_high_county exposure_grp_mem_high_county
clustering_county clustering_county
volunteering_rate_county volunteering_rate_county
Minimum Maximum
county 1001 56045
county_name Abbeville, South Carolina Ziebach, South Dakota
ec_county 0.29469 1.3597
ec_se_county 0.00436 0.05023
ec_high_county 0.70062 1.71507
ec_high_se_county 0.00475 0.05099
exposure_grp_mem_county 0.2552 1.48628
exposure_grp_mem_high_county 0.51013 1.66616
clustering_county 0.07162 0.26097
volunteering_rate_county 0.00965 0.308736
This is the following cell range for each column:
ec_county - 0.29469 - 1.3597. The lower the number is, the less connectedness there is between high-SES and low-SES individuals. The standard deviation ranges from around 0.004-0.05 per county.
ec_high_county - 0.70062 - 1.71507. The lower the number is, the less connectedness there is between high-SES individuals. The standard deviation for this ranges from 0.004-0.05 per county, similar to the original ec_county statistic.
exposure_grp_mem_county - 0.2552-1.48628. The lower the number is, the less exposure people of high-SES to low-SES individuals.
exposure_grp_mem_high_county - 0.51013-1.66616. The lower the number is, the less exposure high-SES people have to other high-SES individuals.
clustering_county - 0.07162-0.26097. The lower the number equates to a lower amount of mutual friend circles within the county.
volunteering_rate_county - 0.00965 - 0.308736. The lower the number is, the less volunteers the county has.
We will also note the mean (average) values for future reference.
mean_values = df.mean(numeric_only=True)
print(mean_values)county 30218.783101
ec_county 0.814464
ec_se_county 0.013409
ec_high_county 1.252636
ec_high_se_county 0.014754
exposure_grp_mem_county 0.906089
exposure_grp_mem_high_county 1.078581
clustering_county 0.116456
volunteering_rate_county 0.078068
dtype: float64
Data Exploration
Before analyzing the data, we should identify and understand the dataset on these following bases.
1) Is there any relationship between economic connectedness between low and high SES and just high SES? If so, what’s the pattern? Do those categories seem to have a balanced distribution, or are there more counties that are on the lower or higher end of the spectrum?
2) What do the levels of clustering_county, or fraction of friend_pairs, look like around the country? Is there any clear average?
3) What do the levels of volunteering rates look like around the country? Is there any clear average?
#### Data Exploration #1
df.plot.scatter(x='ec_county', y='ec_high_county')
plt.xlabel("Economic Connectedness between low and high SES")
plt.ylabel("Connectedness within high SES")
plt.title("Relationship between Economic Connectedness of Low and High SES and Economic Connectedness within High SES")
plt.show()
This scatterplot aims to recognize if there’s any correlation between connectedness between low and high SES and just high SES individuals.
It seems there is a somewhat balanced, upward linear pattern between the two indicators. For the most part, the higher the economic connectedness between low and high SES indicates it’ll also be higher between those in just higher SES. There are some outliers, but none that indicate something incredibly wrong with the data. It tends to fall within the standard deviation predicted above. It is important to note that most of the outliers tend to fall with higher scores on the higher connectedness within just high-SES individuals, indicating that many counties do have a tendency to prioritize connectedness within one’s own social class rather than a variety. It’ll be interesting to see if this correlates with friendship circles as we get into the analysis.
While it looks like the economic connectedness is relatively evenly distributed country-wide, a histogram should be conducted to double check on the categories.
#### Data Exploration #2
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(df['ec_county'], bins=10, range=(0.2, 1.4), edgecolor='black')
plt.xlabel("Economic Connectedness per ec_county")
plt.ylabel("Frequency")
plt.title("Histogram of Economic Connectedness Between Low and High-SES")
plt.subplot(1, 2, 2)
plt.hist(df['ec_high_county'], bins=10, range=(0.7, 1.7), edgecolor='black')
plt.xlabel("Economic Connectedness per ec_high_county")
plt.ylabel("Frequency")
plt.title("Histogram of Economic Connectedness for only High-SES")
plt.subplots_adjust(wspace=0.4)
plt.show()
The original prediction of even distribution across the economic connectedness scores is incorrect.
The economic connectedness between low and high SES appears to be on somewhat a bell-curve, which means that there is higher frequency in the mid-range of the distribution with a score of around 0.8. This also happens to be the average or mean score. In simpler terms, this means that counties tend to have a moderate level of connectedness between people of different socioeconomic status.
On the other hand, the economic connectedness amongst only those of high-SES is slightly skewed to the higher range of the data. This means that those within the higher socioeconomic category tend to have stronger connections. This might also explain why in the scatterplot most of the deviations were indicated higher connectedness with only high-SES individuals.
#### Data Exploration #3
def extract_county_name(name): ## couldn't find a .geojson that had the counties lists as (County Name, State Name), like the dataset, hence the function
parts = name.split(',')
if len(parts) > 1:
return parts[0].strip()
else:
return name
df['county_name'] = df['county_name'].apply(extract_county_name)
m = folium.Map(location=[36, -98], zoom_start=4)
county_geo_url = "https://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_050_00_20m.json"
folium.Choropleth(
geo_data=county_geo_url,
data=df,
columns=['county_name', 'clustering_county'],
key_on='feature.properties.NAME',
fill_color='YlGnBu',
).add_to(m)
mIn the main parts of the United States, most counties tend to be yellow or a ligher green, with some even getting into an aquamarine panel. Nevertheless, it’s important to note that the lighter green demonstrates relatively low clusters. Another histogram will be essential to confirm this.
As an interesting note, clusters are extremely frequent in Alaskan counties with many of the counties being aquamarine, medium blue, or dark blue. This might be because the population in Alaska is lower, so communities may be more tightly knit.
#### Data Exploration #4
df.hist(column='clustering_county')
plt.xlabel("Frequency of Clusters / Mutual Connections within every County")
plt.ylabel("Frequency")
plt.title("Frequency of Relationship Cohesiveness within every County")
plt.show()
As predicted, clusters tend to be less frequent within each county, with the histogram heavily skewed to the left.
Data Exploration #5
m = folium.Map(location=[36, -98], zoom_start=4)
county_geo_url = "https://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_050_00_20m.json"
folium.Choropleth(
geo_data=county_geo_url,
data=df,
columns=['county_name', 'volunteering_rate_county'],
key_on='feature.properties.NAME',
fill_color='YlOrRd',
).add_to(m)
mThis map seems to be more evenly distributed. Many of the states seem to be yellow to light orange, with a few noticeable ones in the reds. However, an important note is that the west (except for California) midwest, and northeast tend to have more counties in the orange range. Alaska and Hawaii entirely seem to be orange to dark orange as well.
This should also be confirmed with a histogram, but from this map it seems that the rate will be similarly skewed as the cluster and cohesiveness rate.
#### Data Visualization #6
df.hist(column='volunteering_rate_county')
plt.xlabel("Frequency of Clusters / Mutual Connections within every County")
plt.ylabel("Frequency")
plt.title("Frequency of Relationship Cohesiveness within every County")
plt.show()
The above prediction was correct.
Data Cleaning
As a final step to the process, we will do some final data cleaning. There is no huge outlier, so worrying about that is no problem. However, as identified in the maps, there are some counties with no data. We will first identify whether there are any rows with missing values - if not, this would mean that the values did not correctly match up with the .geojson.
missing_values = df[df.isnull().any(axis=1)]
print(missing_values) county county_name ec_county ec_se_county ec_high_county \
52 1105 Perry NaN NaN NaN
71 2060 Bristol Bay NaN NaN NaN
82 2164 Lake and Peninsula NaN NaN NaN
186 6003 Alpine NaN NaN NaN
255 8023 Costilla NaN NaN NaN
... ... ... ... ... ...
2707 48433 Stonewall NaN NaN NaN
2713 48447 Throckmorton NaN NaN NaN
2748 49009 Daggett NaN NaN NaN
2759 49031 Piute NaN NaN NaN
2831 51091 Highland NaN NaN NaN
ec_high_se_county exposure_grp_mem_county \
52 NaN NaN
71 NaN NaN
82 NaN NaN
186 NaN NaN
255 NaN NaN
... ... ...
2707 NaN NaN
2713 NaN NaN
2748 NaN NaN
2759 NaN NaN
2831 NaN NaN
exposure_grp_mem_high_county clustering_county \
52 NaN 0.198011
71 NaN 0.133101
82 NaN 0.156810
186 NaN 0.091637
255 NaN 0.100376
... ... ...
2707 NaN 0.133058
2713 NaN 0.105239
2748 NaN 0.094783
2759 NaN 0.123849
2831 NaN 0.152162
volunteering_rate_county
52 0.031368
71 0.042672
82 0.086069
186 0.146537
255 0.062568
... ...
2707 0.033803
2713 0.244292
2748 0.026229
2759 0.030337
2831 0.188111
[77 rows x 10 columns]
While the ec_county is missing, none of the values are missing from clustering_county or volunteering_rate_county. This means certain values did not match up with the .geojson file. We will check for these values.
import requests
response = requests.get(county_geo_url)
county_geo_data = response.json()
geojson_names = set(feature['properties']['NAME'] for feature in county_geo_data['features'])
df_names = set(df['county_name'])
missing_names = df_names - geojson_names
print("Missing county names:", missing_names)Missing county names: {'St. Louis County', 'Franklin County', 'St. Louis City', 'Richmond City', 'Roanoke City', 'Baltimore City', 'Baltimore County', 'Fairfax County', 'Ste Genevieve', 'Roanoke County', 'Richmond County', 'Bedford County', 'DoÃ\x83±a Ana'}
These are the county names that weren’t addressed in the .geojson.
Many of these have different names in the main dataset. These seem to be the common errors. For example, Richmond County is just named Richmond.
1) Counties like “Richmond County” that have no city are simply just named “Richmond” in the dataset. Some also have periods, such as “Ste. Genevieve.”
2) Differentiating between a city and county is not indicated in the name of the geojson.
We will fix all of these, and then make sure we do not have any more missing names.
name_mapping = {
"0500000US51161": {"NAME": "Roanoke County"},
"0500000US51770": {"NAME": "Roanoke City"},
"0500000US24510": {"NAME": "Baltimore City"},
"0500000US29189": {"NAME": "St. Louis County"},
"0500000US51059": {"NAME": "Fairfax County"},
"0500000US24005": {"NAME": "Baltimore County"},
"0500000US51019": {"NAME": "Bedford County"},
"0500000US35013": {"NAME": "DoÃ\x83±a Ana"},
"0500000US51159": {"NAME": "Richmond County"},
"0500000US51760": {"NAME": "Richmond City"},
"0500000US29510": {"NAME": "St. Louis City"},
"0500000US29186": {"NAME": "Ste Genevieve"},
"0500000US51067": {"NAME": "Franklin County"},
}
def update_geojson_feature_properties(feature, properties):
feature['properties'].update(properties)
for feature in county_geo_data['features']:
census_area = feature['properties'].get('GEO_ID', '')
if census_area in name_mapping:
update_geojson_feature_properties(feature, name_mapping[census_area])
geojson_names = set(feature['properties']['NAME'] for feature in county_geo_data['features'])
df_names = set(df['county_name'])
missing_names = df_names - geojson_names
print("Missing county names:", missing_names)Missing county names: set()
Note, I want to change the cleaning to be in methods and create a separate section for the exploration called Results. I made the mistake of not doing this prior, but I will switch these up in the final project.
Discussion
This section will demonstrate more data visualizations and include analysis of such to capture the big picture and findings towards the research question now that we understand the dataset.
[Include Introduction Here, Including Hypothesis]
Data Visualization #1
df.plot.scatter(x='ec_county', y='exposure_grp_mem_county')
plt.xlabel("Economic Connectedness between low and high SES")
plt.ylabel("Exposure of Population to low and high SES")
plt.title("Relationship between Exposure to Connection between low and high SES")
plt.show()
Include analysis here.
Data Visualization #2
df.plot.scatter(x='ec_county', y='clustering_county')
plt.xlabel("Economic Connectedness between low and high SES")
plt.ylabel("Level of Cohesiveness")
plt.title("Relationship between relationship connectedness between SES and overall cohesiveness in a county")
plt.show()
Include analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.
Data Visualization #3
Include analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.
Data Visualization #4
m = folium.Map(location=[36, -98], zoom_start=4)
folium.Choropleth(
geo_data=county_geo_url,
data=df,
columns=['county_name', 'clustering_county'],
key_on='feature.properties.NAME',
fill_color='YlOrRd',
name='Cohesiveness Levels',
).add_to(m)
folium.Choropleth(
geo_data=county_geo_url,
data=df,
columns=['county_name', 'ec_county'],
key_on='feature.properties.NAME',
fill_color='YlOrRd',
name='Connectedness Levels',
).add_to(m)
folium.LayerControl().add_to(m)
mInclude analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.
Data Visualization #5
m = folium.Map(location=[36, -98], zoom_start=4)
folium.Choropleth(
geo_data=county_geo_url,
data=df,
columns=['county_name', 'clustering_county'],
key_on='feature.properties.NAME',
fill_color='YlOrRd',
name='Cohesiveness Levels',
).add_to(m)
folium.Choropleth(
geo_data=county_geo_url,
data=df,
columns=['county_name', 'ec_high_county'],
key_on='feature.properties.NAME',
fill_color='YlOrRd',
name='High-SES Connectedness',
).add_to(m)
folium.LayerControl().add_to(m)
mInclude analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.
Data Visualization #6
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(df['clustering_county'], bins=10, range=(0.071, 0.26), edgecolor='black')
plt.xlabel("Cohesiveness")
plt.ylabel("Frequency")
plt.title("Frequency of Cohesiveness")
plt.subplot(1, 2, 2)
plt.hist(df['volunteering_rate_county'], bins=10, range=(0.00965, 0.308), edgecolor='black')
plt.xlabel("Volunteer Rate")
plt.ylabel("Frequency")
plt.title("Frequency of Volunteer Rate")
plt.subplots_adjust(wspace=0.4)
plt.show()
Include analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.
Data Visualization #7
df.plot.scatter(x='clustering_county', y='volunteering_rate_county')
plt.xlabel("Cohesiveness Within County")
plt.ylabel("Volunteering Rate Within County")
plt.title("Relationship between Cohesiveness and Volunteer Rate")
plt.show()
Include analysis here. Mention how this doesn’t fall under the hypothesis that higher connectedness makes for higher cohesiveness.
Data Visualization #8
m = folium.Map(location=[36, -98], zoom_start=4)
folium.Choropleth(
geo_data=county_geo_url,
data=df,
columns=['county_name', 'clustering_county'],
key_on='feature.properties.NAME',
fill_color='YlOrRd',
name='Cohesiveness Levels',
).add_to(m)
folium.Choropleth(
geo_data=county_geo_url,
data=df,
columns=['county_name', 'volunteering_rate_county'],
key_on='feature.properties.NAME',
fill_color='YlOrRd',
name='Volunteering Rate',
).add_to(m)
folium.LayerControl().add_to(m)
mConclusion
Conclusion of discussion to be written here.